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METHOD AND APPARATUS FOR DETERMINING MOLECULAR 
CRYSTAL STRUCTURES 

The present invention provides a method and apparatus for 
5 determining molecular crystal structures. In particular, but not exclusively, 
with the present invention the molecular crystal structure of large organic 
molecules, such as pharmaceutical compounds, can be determined using 
data from powder diffraction analysis. 

Information on the molecular crystal structure of a molecule is 

10 usually obtained through irradiation of a single crystal of the molecule with 
neutrons or X-rays. Subsequent analysis of the resultant diffraction 
pattern, which consists of a series of angularly spaced intensity peaks with 
each peak representing an individual Bragg reflection, provides information 
on the structure. Whilst this single crystal diffraction technique is an 

15 effective technique for determining the crystal structure of a molecule, it 
can often prove difficult to grow the single crystals necessary for the 
analysis. Moreover, where the molecule can crystallise in more than one 
polymorphic form, it is sometimes the case that it can prove very difficult to 
grow a single crystal of a particular polymorph. 

20 To address these problems, a powder diffraction analysis was 

developed in which a crystalline powder of the material under analysis is 
irradiated instead of a single crystal. Analysis of the resultant diffraction 
pattern is hampered by the fact that the diffraction pattern may include 
Bragg reflections that partially or fully overlap one another, making it 

25 difficult for individual reflections to be identified, and their associated 

intensities quantified. An example of experimental data from irradiation of 
a powder sample of the drug substance cimetidine in the form of a graph 
representing intensity of the Bragg reflections with respect to angular 
position is shown in Figure 1. This diffraction pattern can be used in a 

30 point-to-point comparison with diffraction data calculated from a postulated 
model of the crystal structure. If there is good agreement between the 
measured and calculated diffraction data, it may be assumed that the 
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postulated structure is close to the true crystal structure of the molecule. In 
general, good agreement is only obtained when there is significant prior 
knowledge of the true crystal structure of the molecule, as there is an 
infinite number of crystal structures that may be postulated and compared 
5 to the experimental data. 

The present invention seeks to address the problems discussed 
above with respect to existing diffraction analysis techniques and seeks to 
provide a method and apparatus for determining molecular crystal 
structures which employs irradiation of crystalline powders and permits 
1 0 analysis without the need for prior knowledge of the approximate crystal 
structure. 

The present invention provides a method for determining molecular 
crystal structures comprising the steps of: determining from powder 
diffraction data for the molecule under examination a unit cell and a space 

15 group; generating a reduced representation of the powder diffraction 

pattern, in which the total quantity of diffraction data is significantly reduced 
whilst maintaining the characteristics of the diffraction data that are 
representative of the crystal structure under examination; defining the 
molecular structure in terms of a plurality of internal co-ordinates; 

20 determining a set of variables for describing trial molecular structures, 
derived from said internal co-ordinates and said space group; assigning 
values to said variables thereby creating a population of trial structures 
each defined by a unique set of values for said variables; calculating a 
fitness for each trial structure with respect to the reduced representation of 

25 the powder diffraction pattern; determining whether any one of the 

calculated fitnesses is less than or equal to a predetermined threshold; 
where none of the calculated fitnesses is less than or equal to the 
threshold value, selecting at least one survivor from the population of trial 
structures, altering the values of the variables of at least one of the 

30 survivors in accordance with one or more predetermined rules, calculating 
the fitnesses of the new trial structures; and repeating the steps of 
selecting survivors, altering the values of the variables and calculating the 
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fitnesses of the new trial structures until at least one of the calculated 
fitnesses is less than or equal to the threshold value, and where at least 
one of the calculated fitnesses is less than or equal to the threshold, 
outputting the at least one trial molecular crystal structures represented by 
5 the successful sets of values. 

In a further aspect the present invention provides apparatus for 
determining molecular crystal structures comprising a structure factor 
analyser for generating from experimental powder diffraction data for the 
molecule under examination a unit cell, a space group and a reduced 
1 0 representation of the powder diffraction pattern in which the total quantity 
of diffraction data is significantly reduced whilst the characteristics of the 
diffraction data representative of the crystal structure under examination 
are maintained; a co-ordinate generator for defining the molecule in terms 
of a set of internal co-ordinates; a controller for determining a set of 
1 5 variables for describing trial molecular structures, derived from said internal 
co-ordinates and said space group; a searching processor for creating a 
population of trial structures each defined by a unique set of values for said 
variables said searching processor including a fitness analyser for 
calculating a fitness for each trial structure with respect to the reduced 
20 representation of the powder diffraction pattern, a thresholding device for 
determining whether any one of the calculated fitnesses is less than or 
equal to a predetermined threshold, a survivor selector for selecting at 
least one survivor from the population of trial structures, a variable 
adjustment device for altering the values of the variables of at least one of 
25 the survivors and output means for outputting the one or more trial 

molecular crystal structures having calculated fitnesses less than or equal 
to the threshold value. 

Ideally, the reduced representation is in the form of a structure 
factor intensity listing and associated covariance matrix. Moreover, 
30 preferably, the fitness x of each of lhe trial structures is determined using 
the following function: 
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X 2 = Zh S k { ( l h -c|F h | 2 ) (V 1 ) hk < l k - c|F k | 2 ) } 

where: 

l h ,k = extracted intensity from the first analyser 
5 Vhk = covariance matrix from the first analyser 
c = a scale factor 

F h ,k = calculated structure factor from trial structure 

The search for a three dimensional structure of a molecule which 

10 would produce a powder diffraction pattern nearly identical to available 

experimental results is greatly simplified with the present invention. This is 
achieved by reduction of trial molecular crystal structures to a unique set of 
variables and by reduction of the experimental powder diffraction data to a 
reduced representation such as a structure factor intensity listing which is 

15 then used in determining a fitness of the trial structure with respect to the 
reduced experimental data. The present invention relies on the fact that at 
its most basic, a molecular crystal structure can be represented by a set of 
internal co-ordinates describing the molecule under investigation, together 
with co-ordinates describing the location and orientation of the molecule 

20 within a unit cell. The reduction of the molecular crystal structure to a set 
of variables enables analysis of the trial structures to be performed much 
more quickly than an analysis performed using the conventional method of 
describing the crystal structure in terms of the fractional or Cartesian co- 
ordinates of every atom in the asymmetric unit of the structure. Such 

25 conventional representations are considered to be unworkable in a model 
building sense because of the computing power necessary to position 
individual atoms independently of each other. The representation of the 
trial structures used in the invention along with the novel fitness function 
means that analyses can be performed in seconds or minutes on the 

30 current generation of conventional personal computers or workstations. 
Moreover, the representation is versatile as it allows the invention to work 
with flexible molecules as well as multiple fragments. 




5 



An embodiment of the present invention will now be described by 
way of example with reference to the accompanying drawings, in which: 

Figure 1 is a graph of experimental data from x-ray powder 
diffraction analysis of cimetidine showing 315 reflections, using an 
5 irradiation wavelength of 1 .5285A and a data range for 20 of 8°-56°; 

Figure 2 is a schematic representation of the 2D molecular structure 
of cimetidine; 

Figure 3 is a flow diagram of the method steps for determining a 
molecular structure in accordance with the present invention; 

1 o Figure 4 is a diagram of the crystal structure of cimetidine; 

Figures 5a, 5b, 5c and 5d are diagrams showing the progressive 
development of a trial crystal structure for cimetidine, employing the 
method and apparatus in accordance with the present invention, overlying 
the diagram of Figure 4; 

15 Figure 6 is a graph showing the fitness of a trial crystal structure for 

cimetidine with respect to generations, employing the method and 
apparatus in accordance with the present invention; and 

Figures 7a, 7b and 7c show the molecular structure of dopamine 
deuterobromide, a graph of the development of a trial structure for the 

20 crystal and a diagram of the solution respectively, employing the method 
and apparatus of the present invention. 

The present invention will be described with reference to an 
experimental determination of the crystal structure of the molecule 
cimetidine, a histamine H 2 antagonist used in the treatment of stomach 

25 ulcers, for which a full single crystal structure (monoclinic Form A) 
determination has already been performed. Figure 2 shows the 2D 
chemical formula of the cimetidine molecule, whilst the known arrangement 
of the cimetidine molecules within the unit cell of the crystal structure is 
shown in Figure 4. 

30 To determine the molecular crystal structure of cimetidine employing 

the method and apparatus of the present invention with reference to Figure 
3, initially a conventional powder diffraction pattern (10) is obtained from a 
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crystalline powder sample of cimetidine. The resultant diffraction pattern is 
shown in Figure 1. The experimental diffraction data (10) is input into a 
cell dimension analyser (12). The cell dimension analyser (12) uses 
conventional techniques to assess the diffraction pattern in order to 
5 determine the unit cell dimensions of the crystal structure. The diffraction 
pattern is also input to a structure factor analyser (14) that also receives 
the unit cell dimensions determined by the analyser (12). The structure 
factor analyser (14) analyses the experimental diffraction pattern using the 
lowest symmetry space group consistent with the crystal class determined 

10 by the cell dimension analyser (12), reducing the data to a first structure 
factor intensity listing and an associated covariance matrix. From this 
listing, the true space group (16) of the crystal structure is determined and 
used by the structure factor analyser (14) to generate a second structure 
factor intensity listing and associated covariance matrix (18). By 

15 generating this second structure factor intensity listing and associated 
covariance matrix (18), the total quantity of the original experimental 
diffraction data is significantly reduced in amount without loss of those 
characteristics of the data representative of the crystal structure under 
examination. Thus, the experimental diffraction data is not presented for 

20 analysis as a point-by-point profile, but rather in a reduced data form 
enabling the data to be analysed using a fitness function described in 
greater detail below. 

Using the known 2D chemical formula for cimetidine (20), a co- 
ordinate generator (22) determines a set of internal co-ordinates (24) which 

25 completely describe the three dimensional structure of the molecule. The 
internal co-ordinates (24) include known data using tabulated bond lengths, 
bond angles and rigid torsion angles, where necessary, along with 
identification of unknown variables such as flexible torsion angles. When 
postulated values for the unknown variables are added, sufficient 

30 information is present in the internal co-ordinates to define the 
conformation of an isolated theoretical cimetidine molecule. 

Preferably, the only unknown factors and so the only variables to be 
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found in the internal co-ordinates are the values of the variable torsion 
angles (represented by variables xt, x 2 , x 3 , x 4i t 5 ...). It is not essential for 
the bond lengths and bond angles to be held fixed and where appropriate 
these factors too may be varied in determining the crystal structure of the 
5 molecule. It has been found though that variation of the bond lengths and 
bond angles within chemically sensible bounds has a much smaller effect 
on the calculated diffraction data than variation of the flexible torsion 
angles within the structure. Thus for most purposes, acceptable results 
can be achieved with these factors held fixed. 

10 The output (24) of the co-ordinate generator (22) is supplied to a 

controller (26) that is also connected to the space group output (16) of the 
structure factor analyser (14). The controller (26) also includes an input 
(28) to enable manual setting of selected operational parameters such as 
the number of trial structures to be analysed in each generation, i.e. the 

15 population size. The controller (26) uses the internal co-ordinates and the 
space group to determine additional variables representing the location 
and orientation of a molecular structure in the unit cell. Preferably, the 
location of the structure within the unit cell is defined using a single 
reference point in fractional co-ordinate space represented by external co- 

20 ordinates or variables (x, y, z). The orientation of the molecule at that point 
may be described using Euler angles (a, fi, y). Alternatively, the orientation 
of the molecule may be described using a quaternion, q. 

In this way the molecular crystal structure is reduced to a set of 
variables consisting of internal and external co-ordinates: 

25 {x, y, z, a, fi, y. x lf x 2 , t 3 . t 4 , t 5 ...} or {x, y, z t g, t 1( x 2 , 13, x 4 , x 5 ...}. 

These variables are suitable for iterative mathematical processing and are 
more amenable to search procedures than the full complement of 
individual atomic co-ordinates used in conventional techniques. 

The output (30) from the controller (26) is then supplied to an 

30 iterative searching processor (32). The output (30) consists of the set of 
variables determined by the controller (26); the complete internal co- 
ordinates produced by the co-ordinate generator (22); operating 
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parameters such as the selected size of the population to be employed in 
the searching procedure; any rules restricting or controlling the values 
which can be allocated to each of the variables; and any rules controlling 
the selection of survivors, the breeding and the mutation of survivors, 
5 described in greater detail below. 

In the method shown in Figure 3, the iterative searching processor 
(32) employs a genetic algorithm to determine the correct molecular crystal 
structure. The above mentioned set of variables {x, y, z, a, (3, y, x 1( x 2 , 13, 
t 4i x 5 ...} or {x, y, z, q, n, x 2 , t 3 , t 4 , t 5 ...} are thus equated to chromosomes, 
10 with each individual variable equating to a gene. The genetic algorithm 
establishes certain protocols based on the concept of 'survival of the 
fittest', with respect to the selection of survivors, the breeding and the 
mutation of survivors. 

Firstly, within the searching processor (32) an initial population of 
1 5 chromosomes is created (33) by assigning random numbers to each of the 
genes of each of the chromosomes. The allowable random numbers for 
any particular gene may be restricted in accordance with rules input from 
the controller (26). The selected size of this initial population depends 
somewhat upon the complexity of the structure under investigation, with 
20 larger population sizes typically being required for problems involving more 
variables. In the case of cimetidine, where seven torsion angles were 
allowed to vary, resulting in thirteen degrees of freedom, a population size 
of 150 was chosen. The fractional co-ordinates (x, y, z) and Euler angles 
are randomly set as real numbers normally bounded by the Euclidian 
25 normalises of the space group. The variable torsion angles (t) are 
typically randomly set as real numbers in the range 0°-360°. 

Using the internal co-ordinates a three dimensional structure of the 
trial molecule is constructed (35) for each parent in Cartesian space, and 
then in fractional space with respect to the crystal unit cell. Diffraction data 
30 is then determined (37) for each of the trial molecular structures and a 
fitness value, is calculated (39) for each trial structure with respect to 
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the structure factor intensity listing and covariance matrix. The preferred 
fitness function employed is as follows: 

X 2 = Z h Z k { ( l h -c|F h | 2 ) (VV ( l k - c|F k | 2 ) } 

5 

where: 

l h k = extracted intensity from the first analyser 
V h k = covariance matrix from the first analyser 
c = a scale factor 
1 0 F h ,k = calculated structure factor from trial structure 

The fitness values for each of the chromosomes is compared to a 
predetermined threshold value (41) so that in the event any one of the 
chromosomes is less than or equal to the threshold value a solution for the 

1 5 molecular structure is output (43). In the event that the fitness values of all 
of the chromosomes exceed the threshold value, the chromosomes are 
then supplied (45) to a survivor selector (47). At the same time a counter 
(49) is increased by one so that a record of the number of generations 
created is maintained. 

20 Using the fitness values obtained for each of the chromosomes, the 

survivor selector (47) employs a proportional selection scheme, in which 
the chances of a chromosome surviving are proportional to its fitness, to 
select a number of survivors. Other criteria for selecting survivors may 
alternatively be used. For example, a tournament selection may be 

25 employed in which case two chromosomes are selected at random and 
compared with one another, with the fittest surviving. In particular the 
Boltzmann tournament may be used as it introduces an element of 
simulated annealing to the selection process. In addition, the selection 
may be elitist with the best member of the population in terms of fitness 

30 always surviving to enter the next generation. 

Additional fitness functions may also be employed instead of, or in 
combination with the aforementioned fitness function, to further enhance 




10 



the analysis of the trial structures. For example, simultaneous fitting of 
both X-ray and neutron diffraction data; use of a molecular packing 
function; use of an isolated molecule Lennard-Jones type calculation; use 
of a rotation / translation function; and use of phase information derived 

5 from direct / Patterson methods. 

Although the above method is described in terms of the entire 
population being subject to a common selection, the population may be 
divided into sub-populations in which each sub-population evolves 
independently of the other sub-populations albeit that migration from one 

10 sub-population to another can be enabled. 

The surviving chromosomes are then used to create offspring (51) 
by allowing the chromosomes to 'breed'. For example, individual genes 
from different chromosome survivors may be mixed and/or one or more of 
the genes in a chromosome survivor or its offspring may be mutated by 

1 5 random selection of a new value for the gene. Often, the population size is 
kept constant throughout this breeding process. The three dimensional 
structure of each of the offspring is then determined (35), as before, and 
theoretical diffraction data calculated (37). 

The fitness (x. 2 ) is then evaluated (39) for each of the offspring and 

20 the fitness results compared (41) to the predetermined threshold value to 
determine whether a likely crystal structure for the molecule has been 
identified. If one of the offspring chromosomes has a fitness value which is 
less than or equal to the threshold value, or if a predetermined maximum 
number of generations has been reached, then the search procedure is 

25 stopped (43). On the other hand, if the fitness functions of the 

chromosome offspring all exceed the threshold value and the counted 
number of generations is less than the maximum allowed number, then the 
offspring are returned for the selection of survivors (47) and for the creation 
of new offspring (51). 

30 Additional rules may also be employed where appropriate to 

constrain the allowable values for the variables. These rules are 
determined by the controller (26) that may utilise data on crystal fragments 
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stored in a memory (53). For example, the controller (26) may search 
through stored crystallographic databases of known crystal structure 
fragments related to the molecule to provide prior information about torsion 
angle values likely to be adopted by the structure. Such information can 
5 then be implemented either as hard limits on the allowable values the 
torsion angles may adopt, or as probability distributions for the torsion 
angles. Furthermore, fragments of the molecule may be located using 
Patterson methods or direct methods. For example, the location of a 
heavy atom may be used to anchor a molecule during the analysis by the 
10 searching processor (32). This effectively reduces the dimensionality of 
the problem by three as the fractional reference co-ordinates are then 
known. 

Operation of the processor (32) in the search for the correct 3D 
molecular crystal structure is thus an iterative procedure with the average 

15 fitness for each generation gradually tending towards the global minimum 
in fitness function space. In Figure 5a, a trial cimetidine crystal structure, 
corresponding to a chromosome in the first generation initialised at random 
by the processor (32), is shown overlying the true crystal structure first 
shown in Figure 3. Figure 5b then shows one of the early offspring 

20 determined by the processor having a fitness value of x. 2= 980, again 
overlying the true crystal structure of cimetidine. In Figure 5c, a later 
offspring having a fitness value of x 2 =430 is shown and the improvement in 
correspondence between the trial crystal structure and the actual crystal 
structure is immediately evident. At this point, the crystal structure could 

25 be refined using a conventional constrained Rietveld refinement. Hence, 
the processor (32) may be arranged so that the threshold value for the 
fitness function is set at around 450. This would result in the search 
procedure being stopped once the trial structure shown in Figure 5c had 
been generated, thereby enabling alternative methods to be used to refine 

30 the fine details of the trial structure. The advantage of stopping the search 
procedure at this point is that, usually, conventional methods will be able to 
refine the fine details of the structure more efficiently than the presently 
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described method and apparatus. 

Continuing with the present method, in Figure 5d an offspring having 
a fitness value of x 2 =1 10 is shown at which point the detail of the trial 
structure is easily refinable. Figure 6 is a graph of trial results for 
cimetidine using the method described above showing the fitness value of 
offspring with respect to the number of generations for both average fitness 
and the best fitness. As can be seen, a refinable structure is obtained 
within a few hundred generations, and an easily refinable structure is 
obtained around 3000 generations. This latter structure corresponds to an 
elapsed time of approximate 40 minutes, with the processor running on a 
single 175MHz R10000 Silicon Graphics™ workstation. 

As further examples for the speed of this method, easily refinable 
structures for pyrene were determined in around 33 seconds, around 15 
seconds for chlorothiazide and 36 minutes for Ibuprofen, with all 
calculations being performed on a single 200MHz Pentium Pro™ personal 
computer. 

The above method and apparatus may also be used with molecular 
structures consisting of more than one fragment. As shown in Figures 7a, 
7b and 7c an easily refinable structure solution for dopamine 
deuterobromide using neutron powder diffraction data was achieved in 
around only 4000 generations. This structure involves not only a dopamine 
cation, but also a separate bromide anion. Using the present method and 
apparatus the location, orientation and conformation of the cation, and the 
location of the anion can be determined simultaneously. 

Whilst in the examples given above the individual genes are real 
numbers, they could equally be represented by binary strings or integer 
approximations with appropriate scaling factors. Also, in the example 
given above the experimental diffraction data is reduced to a structure 
factor listing and associated covariance matrix, it will be apparent that 
alternative ways of reducing the total quantity of diffraction data may be 
employed in which the characteristics of the crystal structure under 
examination are not lost. 
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In the above example a genetic algorithm searching processor is 
employed to perform an iterative selection of candidate molecular crystal 
structures. Alternative iterative analysis processes such as simulated 
annealing, evolutionary strategies and neural network analysis may be 
5 used instead of the genetic algorithm. For example, using a simulated 
annealing process, the same variables that were treated as genes by the 
genetic algorithm are individually adjusted by a small perturbation of their 
current values. If the function value (x* as defined previously) is better 
than before, then the new values of the variables are retained. If the 

10 function value is worse, then the new values of the variables are not 

automatically rejected. Instead the new values may be retained if allowed 
by the temperature dependent Boltzmann selection protocol. In this way, 
'uphill' (in terms of x) adjustment of the variables is permissible, helping 
the algorithm to escape from local minima. The initial choice of the 

15 temperature is usually high to allow large 'uphill* moves if necessary, but 
the temperature is usually lowered in some predetermined fashion during 
the iterative process. Alternatively, the population of the trial structures 
may be set to one and a Monte Carlo procedure followed. 

With the method and apparatus described above, molecular crystal 

20 structures may be solved from powder diffraction data alone. Definition of 
the molecular fragments in terms of internal co-ordinates means that for a 
single molecular fragment, problem complexity scales with the number of 
variable torsion angles rather than with the number of atoms in the 
fragment. Thus, complex structures can be represented by quite short 

25 chromosomes and solved relatively easily. The simple description of 

molecular geometry employed, together with the genetic algorithm analysis 
and the specified fitness function has thus been shown to be particularly 
powerful in determining crystal structures from powder diffraction data in a 
relatively short time frame. 

30 To assist in an understanding of the invention, the method has been 

described with reference to functional, i.e. analyser/processor units. It will 
of course be apparent that in practice the method is implemented as a 
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program on a computer. Indeed, one of the advantages of this method is 
that the program can be implemented on a number of different computer 
architectures, including personal computers and a network of personal 
computers/workstations acting as a parallel computer. 
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